MapReduce Join

对两份数据data1和data2进行关键词连接是一个很通用的问题，如果数据量比较小，可以在内存中完成连接。

如果数据量比较大，在内存进行连接操会发生OOM。mapreduce join可以用来解决大数据的连接。

1 思路

1.1 reduce join

在map阶段, 把关键字作为key输出，并在value中标记出数据是来自data1还是data2。因为在shuffle阶段已经自然按key分组，reduce阶段，判断每一个value是来自data1还是data2,在内部分成2组，做集合的乘积。

这种方法有2个问题：

1, map阶段没有对数据瘦身，shuffle的网络传输和排序性能很低。

2, reduce端对2个集合做乘积计算，很耗内存，容易导致OOM。

1.2 map join

两份数据中，如果有一份数据比较小，小数据全部加载到内存，按关键字建立索引。大数据文件作为map的输入文件，对map()函数每一对输入，都能够方便地和已加载到内存的小数据进行连接。把连接结果按key输出，经过shuffle阶段，reduce端得到的就是已经按key分组的，并且连接好了的数据。

这种方法，要使用hadoop中的DistributedCache把小数据分布到各个计算节点，每个map节点都要把小数据库加载到内存，按关键字建立索引。

这种方法有明显的局限性：有一份数据比较小，在map端，能够把它加载到内存，并进行join操作。

1.3 使用内存服务器，扩大节点的内存空间

针对map join，可以把一份数据存放到专门的内存服务器，在map()方法中，对每一个<key,value>的输入对，根据key到内存服务器中取出数据，进行连接

1.4 使用BloomFilter过滤空连接的数据

对其中一份数据在内存中建立BloomFilter，另外一份数据在连接之前，用BloomFilter判断它的key是否存在，如果不存在，那这个记录是空连接，可以忽略。

1.5 使用mapreduce专为join设计的包

在mapreduce包里看到有专门为join设计的包，对这些包还没有学习，不知道怎么使用，只是在这里记录下来，作个提醒。

jar： mapreduce-client-core.jar

package： org.apache.hadoop.mapreduce.lib.join

2 实现reduce join

两个文件，此处只写出部分数据，测试数据movies.dat数据量为3883条，ratings.dat数据量为1000210条数据

movies.dat 数据格式为：1　　::　　Toy Story (1995)　　::　　Animation|Children’s|Comedy

对应字段中文解释：　　电影ID 　　电影名字　　　　　　　　电影类型

ratings.dat 数据格式为：1　　::　　1193　　::　　5　　::　　978300760

对应字段中文解释：　　用户ID　　电影ID　　　评分　　　　评分时间戳

2个文件进行关联实现代码

import java.io.IOException;
  2 import java.net.URI;
  3 import java.util.ArrayList;
  4 import java.util.List;
  5 
  6 import org.apache.hadoop.conf.Configuration;
  7 import org.apache.hadoop.fs.FileSystem;
  8 import org.apache.hadoop.fs.Path;
  9 import org.apache.hadoop.io.IntWritable;
 10 import org.apache.hadoop.io.LongWritable;
 11 import org.apache.hadoop.io.Text;
 12 import org.apache.hadoop.mapreduce.Job;
 13 import org.apache.hadoop.mapreduce.Mapper;
 14 import org.apache.hadoop.mapreduce.Reducer;
 15 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 16 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
 17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 18 
 19 public class MovieMR1 {
 20 
 21     public static void main(String[] args) throws Exception {
 22         
 23         Configuration conf1 = new Configuration();
 24         /*conf1.set("fs.defaultFS", "hdfs://hadoop1:9000/");
 25         System.setProperty("HADOOP_USER_NAME", "hadoop");*/
 26         FileSystem fs1 = FileSystem.get(conf1);
 27         
 28         
 29         Job job = Job.getInstance(conf1);
 30         
 31         job.setJarByClass(MovieMR1.class);
 32         
 33         job.setMapperClass(MoviesMapper.class);
 34         job.setReducerClass(MoviesReduceJoinReducer.class);
 35         
 36         job.setMapOutputKeyClass(Text.class);
 37         job.setMapOutputValueClass(Text.class);
 38         
 39         job.setOutputKeyClass(Text.class);
 40         job.setOutputValueClass(Text.class);
 41         
 42         Path inputPath1 = new Path("D:\\MR\\hw\\movie\\input\\movies");
 43         Path inputPath2 = new Path("D:\\MR\\hw\\movie\\input\\ratings");
 44         Path outputPath1 = new Path("D:\\MR\\hw\\movie\\output");
 45         if(fs1.exists(outputPath1)) {
 46             fs1.delete(outputPath1, true);
 47         }
 48         FileInputFormat.addInputPath(job, inputPath1);
 49         FileInputFormat.addInputPath(job, inputPath2);
 50         FileOutputFormat.setOutputPath(job, outputPath1);
 51         
 52         boolean isDone = job.waitForCompletion(true);
 53         System.exit(isDone ? 0 : 1);
 54     }
 55 
 56     
 57     public static class MoviesMapper extends Mapper<LongWritable, Text, Text, Text>{
 58         
 59         Text outKey = new Text();
 60         Text outValue = new Text();
 61         StringBuilder sb = new StringBuilder();
 62         
 63         protected void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException {
 64             
 65             FileSplit inputSplit = (FileSplit)context.getInputSplit();
 66             String name = inputSplit.getPath().getName();
 67             String[] split = value.toString().split("::");
 68             sb.setLength(0);
 69             
 70             if(name.equals("movies.dat")) {
 71                 //                    1　　::　　Toy Story (1995)　　::　　Animation|Children's|Comedy
 72                 //对应字段中文解释：　　电影ID 　　   电影名字　　　　　　　　                 电影类型
 73                 outKey.set(split[0]);
 74                 StringBuilder append = sb.append(split[1]).append("\t").append(split[2]);
 75                 String str = "movies#"+append.toString();
 76                 outValue.set(str);
 77                 //System.out.println(outKey+"---"+outValue);
 78                 context.write(outKey, outValue);
 79             }else{
 80                 //                    1　　::　　1193　　::　　5　　::　　978300760
 81                 //对应字段中文解释：　　用户ID　　           电影ID　　　      评分　　　　   评分时间戳
 82                 outKey.set(split[1]);
 83                 StringBuilder append = sb.append(split[0]).append("\t").append(split[2]).append("\t").append(split[3]);
 84                 String str = "ratings#" + append.toString();
 85                 outValue.set(str);
 86                 //System.out.println(outKey+"---"+outValue);
 87                 context.write(outKey, outValue);
 88             }
 89         
 90         };
 91         
 92     }
 93     
 94     
 95     public static class MoviesReduceJoinReducer extends Reducer<Text, Text, Text, Text>{
 96         //用来存放    电影ID    电影名称    电影类型    
 97         List<String> moviesList = new ArrayList<>();
 98         //用来存放    电影ID    用户ID 用户评分    时间戳
 99         List<String> ratingsList = new ArrayList<>();
100         Text outValue = new Text();
101         
102         @Override
103         protected void reduce(Text key, Iterable<Text> values, Context context)
104                 throws IOException, InterruptedException {
105             
106             int count = 0;
107             
108             //迭代集合
109             for(Text text : values) {
110                 
111                 //将集合中的元素添加到对应的list中
112                 if(text.toString().startsWith("movies#")) {
113                     String string = text.toString().split("#")[1];
114                     
115                     moviesList.add(string);
116                 }else if(text.toString().startsWith("ratings#")){
117                     String string = text.toString().split("#")[1];
118                     ratingsList.add(string);
119                 }
120             }
121             
122             //获取2个集合的长度
123             long moviesSize = moviesList.size();
124             long ratingsSize = ratingsList.size();
125             
126             for(int i=0;i<moviesSize;i++) {
127                 for(int j=0;j<ratingsSize;j++) {
128                     outValue.set(moviesList.get(i)+"\t"+ratingsList.get(j));
129                     //最后的输出是    电影ID    电影名称    电影类型    用户ID 用户评分    时间戳
130                     context.write(key, outValue);
131                 }
132             }
133             
134             moviesList.clear();
135             ratingsList.clear();
136             
137         }
138         
139     }
140     
141 }

最后的合并结果：　　电影ID　　电影名称　　电影类型　　用户ID　　用户评论　　时间戳

3 实现map join

两个文件，此处只写出部分数据，测试数据movies.dat数据量为3883条，ratings.dat数据量为1000210条数据

movies.dat 数据格式为：1　　::　　Toy Story (1995)　　::　　Animation|Children’s|Comedy

对应字段中文解释：　　电影ID 　　电影名字　　　　　　　　电影类型

ratings.dat 数据格式为：1　　::　　1193　　::　　5　　::　　978300760

对应字段中文解释：　　用户ID　　电影ID　　　评分　　　　评分时间戳

需求：求被评分次数最多的10部电影，并给出评分次数（电影名，评分次数）

实现代码

MovieMR1_1.java

 1 import java.io.DataInput;
 2 import java.io.DataOutput;
 3 import java.io.IOException;
 4 
 5 import org.apache.hadoop.io.WritableComparable;
 6 
 7 public class MovieRating implements WritableComparable<MovieRating>{
 8     private String movieName;
 9     private int count;
10     
11     public String getMovieName() {
12         return movieName;
13     }
14     public void setMovieName(String movieName) {
15         this.movieName = movieName;
16     }
17     public int getCount() {
18         return count;
19     }
20     public void setCount(int count) {
21         this.count = count;
22     }
23     
24     public MovieRating() {}
25     
26     public MovieRating(String movieName, int count) {
27         super();
28         this.movieName = movieName;
29         this.count = count;
30     }
31     
32     
33     @Override
34     public String toString() {
35         return  movieName + "\t" + count;
36     }
37     @Override
38     public void readFields(DataInput in) throws IOException {
39         movieName = in.readUTF();
40         count = in.readInt();
41     }
42     @Override
43     public void write(DataOutput out) throws IOException {
44         out.writeUTF(movieName);
45         out.writeInt(count);
46     }
47     @Override
48     public int compareTo(MovieRating o) {
49         return o.count - this.count ;
50     }
51     
52 }

MovieMR1_2.java

  1 import java.io.IOException;
  2 
  3 import org.apache.hadoop.conf.Configuration;
  4 import org.apache.hadoop.fs.FileSystem;
  5 import org.apache.hadoop.fs.Path;
  6 import org.apache.hadoop.io.LongWritable;
  7 import org.apache.hadoop.io.NullWritable;
  8 import org.apache.hadoop.io.Text;
  9 import org.apache.hadoop.mapreduce.Job;
 10 import org.apache.hadoop.mapreduce.Mapper;
 11 import org.apache.hadoop.mapreduce.Reducer;
 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 14 
 15 public class MovieMR1_2 {
 16 
 17     public static void main(String[] args) throws Exception {
 18         if(args.length < 2) {
 19             args = new String[2];
 20             args[0] = "/movie/output/";
 21             args[1] = "/movie/output_last/";
 22         }
 23         
 24         
 25         Configuration conf1 = new Configuration();
 26         conf1.set("fs.defaultFS", "hdfs://hadoop1:9000/");
 27         System.setProperty("HADOOP_USER_NAME", "hadoop");
 28         FileSystem fs1 = FileSystem.get(conf1);
 29         
 30         
 31         Job job = Job.getInstance(conf1);
 32         
 33         job.setJarByClass(MovieMR1_2.class);
 34         
 35         job.setMapperClass(MoviesMapJoinRatingsMapper2.class);
 36         job.setReducerClass(MovieMR1Reducer2.class);
 37 
 38         
 39         job.setMapOutputKeyClass(MovieRating.class);
 40         job.setMapOutputValueClass(NullWritable.class);
 41         
 42         job.setOutputKeyClass(MovieRating.class);
 43         job.setOutputValueClass(NullWritable.class);
 44         
 45         
 46         Path inputPath1 = new Path(args[0]);
 47         Path outputPath1 = new Path(args[1]);
 48         if(fs1.exists(outputPath1)) {
 49             fs1.delete(outputPath1, true);
 50         }
 51         //对第一步的输出结果进行降序排序
 52         FileInputFormat.setInputPaths(job, inputPath1);
 53         FileOutputFormat.setOutputPath(job, outputPath1);
 54         
 55         boolean isDone = job.waitForCompletion(true);
 56         System.exit(isDone ? 0 : 1);
 57         
 58 
 59     }
 60     
 61     //注意输出类型为自定义对象MovieRating，MovieRating按照降序排序
 62     public static class MoviesMapJoinRatingsMapper2 extends Mapper<LongWritable, Text, MovieRating, NullWritable>{
 63         
 64         MovieRating outKey = new MovieRating();
 65         
 66         @Override
 67         protected void map(LongWritable key, Text value, Context context)
 68                 throws IOException, InterruptedException {
 69             //'Night Mother (1986)         70
 70             String[] split = value.toString().split("\t");
 71             
 72             outKey.setCount(Integer.parseInt(split[1]));;
 73             outKey.setMovieName(split[0]);
 74             
 75             context.write(outKey, NullWritable.get());
 76                         
 77         }
 78                 
 79     }
 80     
 81     //排序之后自然输出，只取前10部电影
 82     public static class MovieMR1Reducer2 extends Reducer<MovieRating, NullWritable, MovieRating, NullWritable>{
 83         
 84         Text outKey = new Text();
 85         int count = 0;
 86         
 87         @Override
 88         protected void reduce(MovieRating key, Iterable<NullWritable> values,Context context) throws IOException, InterruptedException {
 89 
 90             for(NullWritable value : values) {
 91                 count++;
 92                 if(count > 10) {
 93                     return;
 94                 }
 95                 context.write(key, value);
 96                 
 97             }
 98         
 99         }
100         
101     }
102 }

MovieRating.java

  1 import java.io.BufferedReader;
  2 import java.io.FileReader;
  3 import java.io.IOException;
  4 import java.net.URI;
  5 import java.util.HashMap;
  6 import java.util.Map;
  7 
  8 import org.apache.hadoop.conf.Configuration;
  9 import org.apache.hadoop.fs.FileSystem;
 10 import org.apache.hadoop.fs.Path;
 11 import org.apache.hadoop.io.IntWritable;
 12 import org.apache.hadoop.io.LongWritable;
 13 import org.apache.hadoop.io.Text;
 14 import org.apache.hadoop.mapreduce.Job;
 15 import org.apache.hadoop.mapreduce.Mapper;
 16 import org.apache.hadoop.mapreduce.Reducer;
 17 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 18 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 19 
 20 
 21 public class MovieMR1_1 {
 22 
 23     public static void main(String[] args) throws Exception {
 24         
 25         if(args.length < 4) {
 26             args = new String[4];
 27             args[0] = "/movie/input/";
 28             args[1] = "/movie/output/";
 29             args[2] = "/movie/cache/movies.dat";
 30             args[3] = "/movie/output_last/";
 31         }
 32         
 33         
 34         Configuration conf1 = new Configuration();
 35         conf1.set("fs.defaultFS", "hdfs://hadoop1:9000/");
 36         System.setProperty("HADOOP_USER_NAME", "hadoop");
 37         FileSystem fs1 = FileSystem.get(conf1);
 38         
 39         
 40         Job job1 = Job.getInstance(conf1);
 41         
 42         job1.setJarByClass(MovieMR1_1.class);
 43         
 44         job1.setMapperClass(MoviesMapJoinRatingsMapper1.class);
 45         job1.setReducerClass(MovieMR1Reducer1.class);
 46         
 47         job1.setMapOutputKeyClass(Text.class);
 48         job1.setMapOutputValueClass(IntWritable.class);
 49         
 50         job1.setOutputKeyClass(Text.class);
 51         job1.setOutputValueClass(IntWritable.class);
 52         
 53         
 54         
 55         //缓存普通文件到task运行节点的工作目录
 56         URI uri = new URI("hdfs://hadoop1:9000"+args[2]);
 57         System.out.println(uri);
 58         job1.addCacheFile(uri);
 59         
 60         Path inputPath1 = new Path(args[0]);
 61         Path outputPath1 = new Path(args[1]);
 62         if(fs1.exists(outputPath1)) {
 63             fs1.delete(outputPath1, true);
 64         }
 65         FileInputFormat.setInputPaths(job1, inputPath1);
 66         FileOutputFormat.setOutputPath(job1, outputPath1);
 67         
 68         boolean isDone = job1.waitForCompletion(true);
 69         System.exit(isDone ? 0 : 1);
 70        
 71     }
 72     
 73     public static class MoviesMapJoinRatingsMapper1 extends Mapper<LongWritable, Text, Text, IntWritable>{
 74         
 75         //用了存放加载到内存中的movies.dat数据
 76         private static Map<String,String> movieMap =  new HashMap<>();
 77         //key：电影ID
 78         Text outKey = new Text();
 79         //value：电影名+电影类型
 80         IntWritable outValue = new IntWritable();
 81         
 82         
 83         /**
 84          * movies.dat:    1::Toy Story (1995)::Animation|Children's|Comedy
 85          * 
 86          * 
 87          * 将小表(movies.dat)中的数据预先加载到内存中去
 88          * */
 89         @Override
 90         protected void setup(Context context) throws IOException, InterruptedException {
 91             
 92             Path[] localCacheFiles = context.getLocalCacheFiles();
 93             
 94             String strPath = localCacheFiles[0].toUri().toString();
 95             
 96             BufferedReader br = new BufferedReader(new FileReader(strPath));
 97             String readLine;
 98             while((readLine = br.readLine()) != null) {
 99                 
100                 String[] split = readLine.split("::");
101                 String movieId = split[0];
102                 String movieName = split[1];
103                 String movieType = split[2];
104                 
105                 movieMap.put(movieId, movieName+"\t"+movieType);
106             }
107             
108             br.close();
109         }
110         
111         
112         /**
113          * movies.dat:    1    ::    Toy Story (1995)    ::    Animation|Children's|Comedy    
114          *                 电影ID    电影名字                    电影类型
115          * 
116          * ratings.dat:    1    ::    1193    ::    5    ::    978300760
117          *                 用户ID    电影ID        评分        评分时间戳
118          * 
119          * value:    ratings.dat读取的数据
120          * */
121         @Override
122         protected void map(LongWritable key, Text value, Context context)
123                 throws IOException, InterruptedException {
124             
125             String[] split = value.toString().split("::");
126             
127             String userId = split[0];
128             String movieId = split[1];
129             String movieRate = split[2];
130             
131             //根据movieId从内存中获取电影名和类型
132             String movieNameAndType = movieMap.get(movieId);
133             String movieName = movieNameAndType.split("\t")[0];
134             String movieType = movieNameAndType.split("\t")[1];
135             
136             outKey.set(movieName);
137             outValue.set(Integer.parseInt(movieRate));
138             
139             context.write(outKey, outValue);
140             
141         }
142             
143     }
144 
145     
146     public static class MovieMR1Reducer1 extends Reducer<Text, IntWritable, Text, IntWritable>{
147         //每部电影评论的次数
148         int count;
149         //评分次数
150         IntWritable outValue = new IntWritable();
151         
152         @Override
153         protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
154             
155             count = 0;
156             
157             for(IntWritable value : values) {
158                 count++;
159             }
160             
161             outValue.set(count);
162             
163             context.write(key, outValue);
164         }
165         
166     }
167     
168     
169 }

最后的结果